Mozilla is planning to invest more substantially in privacy-preserving machine learning models for applications such as recommending personalized content and detecting malicious behaviour. As such solutions move towards production, it is essential for us to have confidence in the selection of the model and its parameters for a particular dataset, as well as an accurate view into how it will perform in new instances or as the training data evolves.
While the literature contains a broad array of models, evaluation techniques, and metrics, their choice in practice is often guided by convention or convenience, and their sensitivity to different datasets is not always well-understood. Additionally, to our knowledge there is no existing software tool that provides a comprehensive report on the performance of a given model under consideration.
The eventual goal of this project is to build a standard set of tools that Mozilla can use to evaluate the performance of machine learning models in various contexts on the basis of the following principles:
# Basic Computations
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'
import timeit
# Ml Models
from sklearn import metrics
from sklearn.model_selection import train_test_split
# Dynamic Markdowns
from IPython.display import Markdown as md
#**Adding module’s sub-directory to Pythons path**
import os
import sys
sys.path.insert(0, os.path.abspath('../elie_wanko/modules'))
import helpers, knn, logreg
Attributes Information:
This research employes a binary variable, C24: default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:
C6 - C11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows:
The measurement scale for the repayment status is:
- 1 = pay duly;
- 1 = payment delay for one month;
- 2 = payment delay for two months;
- . . .;
- 8 = payment delay for eight months;
- 9 = payment delay for nine months and above.
Data Preview
There are 30,000 rows, all Non-Null and 25 columns in our DataFrame, all of which are of numerical data(int64). The sample table below shows the first 5 columns of our data set and the next shows some basic information of every column ('column Index', 'column_name', 'Non_Null Counts in each column' and 'data type').
df_data = pd.read_csv("../../datasets/defaults.csv")
df_data
From here we can already see that the "id" column is just a repetition of the index column and does not provide any meaningfull definition and thus useless in building our model. We will get rid of it in our upcoming Data cleaning process.
Looking at the data description below, we can make out some meaning out of it but this is not clear yet. For this reason, we will create some univariate and bivariate explorations to have some better understanding of our data.
df_data.describe()
df_data.info()
Data Cleaning
# Drop column 'id'
df_data = df_data.drop(columns='id')
# Rename C6 from pay_0 to pay_1 for consistency
df_data = df_data.rename(columns={"pay_0": "pay_1"})
df_data.to_csv("defaults_data.csv", index=False)
Observations:
The main feature of data set is "C24: defaulted", this tells us whether a customer defaulted or not. Our variables C1 to C23 will help us support our investigation into the feature of interest above. Some basic statistical details and questions we can pose include:
Next, we further investigate using various distributions to discover more insights and find out if our assumptions are true.
df_data.describe()
In this section, we investigate distributions of individual variables. We note any unusual points or outliers, clean things up and prepare to look at relationships between variables.
helpers.univ_bar(data=df_data, column='sex', x_title='sex', hue='defaulted',
var_names=['Male', 'Female'], title="Distribution of men and women that defaulted.")
(1 = male; 2 = female)
We can clearly see that women are more at risk to default than men. Despite this fact, the ratio of defaulters in each category is about 7.8:2.2. We will need to investigate further to find the reasons behind this.
helpers.univ_bar(data=df_data, column='marriage', x_title='marriage', hue='defaulted', fig_w=15,
title="Distribution of clients that defaulted based on their marital stauts.")
(1 = married; 2 = single; 3 = others)
From the plot, clients that are single e are more at risk of defaulting. Though this is closely followed by married clients, we could assume that this difference is because they support each other financially on the long run.
helpers.univ_bar(data=df_data, column='education', x_title='education', hue='defaulted', fig_w=15,
title="Distribution of clients that defaulted based on their level of education.")
(1 = graduate school; 2 = university; 3 = high school; 4 = others)
Our graph shows that clients in university, closely followed by thos is graduate school are more likely to default. This is cloud be caused accumalated loan payments to pay school fees.
helpers.univ_hist(data=df_data, column='age', fig_w=20, bins=20, title="Distribution of clients age.")
With a rigt skewed distribution, we can see that most clients in their late 20s and early 30s are more at risk to default, which makes sense as this was similarly observed in the previous graph depicting the default rate per level of education.
helpers.univ_hist(data=df_data, column='limit_bal', fig_w=20, bins=20, title="Distribution of clients account balance.")
Here the graph depics that most of our clients are defaulters. However, we are limited quite limited to whom exactly they are and why they defaulted. To further investigated to the reasons behind this we will explore a few bivarte distribution.
In this section, we investigate relationships between pairs of major variables ("limit_bal", "sex", "education", "marriage", "age") and the financial status ("pay_1", "bill_amt1", "pay_amt1") of clients accounts in the month of September in our data.
sns.pairplot(df_data, kind='reg', hue="defaulted", vars=['limit_bal', 'sex', 'education', 'marriage', 'age'], plot_kws={'scatter_kws': {'alpha': 0.5}})
sns.pairplot(df_data, kind='reg', hue="defaulted", vars=['pay_1', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6',
'bill_amt1', 'bill_amt2', 'bill_amt3', 'bill_amt4', 'bill_amt5', 'bill_amt6',
'pay_amt1', 'pay_amt2', 'pay_amt3', 'pay_amt4', 'pay_amt5', 'pay_amt6'],
plot_kws={'scatter_kws': {'alpha': 0.5}})
We can observe similar trend across all the following pairwise features History of past payment
pay, Amount of bill statementbill_amtand Amount of previous statementpay_mat. This is indication that we could easily implement a Principal component analysis (PCA) as a dimensionality-reduction technique and discover the variance-covariance structure of a set of variables through linear combinations. A PCA will help us improve algorithm Performance, reduce overfitting and in visualization and understanding the data in high dimensions. However, caution needs to be taken as in reducing the dimentionality too much might result in independent variables becomeingless interpretable and thus information loss. Moreover, we must standardize our data before implementing PCA, otherwise PCA will not be able to find the optimal Principal Components.
We need to prepare our the data in the correct input format for the sklearn models considered. As such, we split our data into independent and dependent(terget) attributes and further into training and test subsets. The subsets tuble below will consists of four variables in the following order:
.fit).predict).fit)x_test# Split independent and target features
independ_attrs, target_attrs = helpers.independ_target_attr_split(df_data)
# Split train and test data subsets
subsets = train_test_split(independ_attrs, target_attrs, test_size=0.1, random_state=1)
Now we do some investigation based-off of two of the most popular methods used in machine learning that can be differentiated by their methology in analysis that is (lazy and eager learning). __This will help us give an idea of how other models in the same category might behave.
After a series of trial and errors, the following choice of hyper parameters seem to give the best possible results.
x_test. However, this doesn't provide us situable data to evaluate our metrics, so we include a threshold parameter of defaulting in our model. The idea here is simple, any client with a probablity of defaulting above the threshold will default and vice versa. On our pericular data set the best results where obtianed at a probablity of 0.35.# time on
knn_start = timeit.default_timer()
# Classifier
knn_pred, knn_true = knn.classifier(
subsets=subsets,
n_neighbors = 7,
threshold = 0.35)
# time off
knn_stop = timeit.default_timer()
knn_time = knn_stop - knn_start
# Metrics
print("Our KNN analysis completed in {:.2f}s with the following scores...\n".format(knn_time))
print("accuracy_score : {:.4f}".format(metrics.accuracy_score(knn_true, knn_pred)))
print("precision_score : {:.4f}".format(metrics.precision_score(knn_true, knn_pred)))
print("recall_score : {:.4f}".format(metrics.recall_score(knn_true, knn_pred)))
print("f1_score : {:.4f}".format(metrics.f1_score(knn_true, knn_pred)))
helpers.confusion_matrix(true=knn_true, pred=knn_pred)
tol: The tolerance value of 0.000014 seem to produce the best results before our training ovefits.# time on
lg_start = timeit.default_timer()
# Classifier
lg_pred, lg_true = logreg.classifier(
tol=0.000014,
subsets=subsets,
data=df_data,
solver='liblinear')
lg_stop = timeit.default_timer()
lg_time = lg_stop - lg_start
# Metrics
print("Our logistic regression analysis completed in {:.2f}s with the following scores...\n".format(lg_time))
print("accuracy_score : {:.4f}".format(metrics.accuracy_score(lg_true, lg_pred)))
print("precision_score : {:.4f}".format(metrics.precision_score(lg_true, lg_pred)))
print("recall_score : {:.4f}".format(metrics.recall_score(lg_true, lg_pred)))
print("f1_score : {:.4f}".format(metrics.f1_score(lg_true, lg_pred)))
helpers.confusion_matrix(true=lg_true, pred=lg_pred)
To further improve our model, we could also consider using the Weight of Evidence(WoE) method which can help us have more control over the features we use during training. WoE consists of two steps:
The WoE transformation has the following advanatges:
Nonetheless, caution needs to be taken when implementing WoE since this methos also has some drawbacks: